2025-09-16 开源方案
2025年9月16日
10:45
我只发现了Data cite corpus这个数据集,没有发现包含metadata的数据集。
第1名:
主要是通过api获取了dataset的标题、作者和年份信息,和article的标题、作者信息、年份进行比较,直接用catboost。
Model Qwen2.5-Coder (32B, AWQ quantization) + vLLM. Coder was better in classification than base model.这个确实是,public lb 0.81->0.85 private lb 0.745->0.749
第5名:
只对samn和doi进行llm分类。
prompt:任务要求放在了user prompt中,我的是直接放在system prompt中
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
### Core Instructions ###
* Inspect WINDOW taking particular interest in ID, given below.
* The ID is specifically a data citation - that relates to data held in an open-access repository.
* Determine whether the WINDOW context holds evidence that the WINDOW authors are responsible for the ID held in the public repository.
* After thinking, give your final answer using the rubric:
* Owner: the WINDOW authors have some sort of ownership around ID.
* User: the data has be re-used/referenced/compared in the WINDOW.
* None: there is no evidence to determine ownership.
* When reviewing the METADATA remember:
* The METADATA is collected from several sources and hence has various formats for authour names and dates.
* The most important thing is finding the overlap of WINDOW author(s) with the ID author(s); usually one author overlap is enough to assume Owner.
* The final answer should be wrapped in \boxed{} containing only User, Owner or None.
# ID
https://doi.org/10.17882/47142
# METADATA
## ID METADATA
[Title]: A global bio-optical database derived from Biogeochemical Argo float measurements within the layer of interest for field and remote ocean color applications
[Authors]: Organelli, Emanuele; Barbieux, Marie; Claustre, Herv; Schmechtig, Catherine; Poteau, Antoine; Bricaud, Annick; Uitz, Julia; Dortenzio, Fabrizio; Dallolmo, Giorgio
## WINDOW METADATA
[Title]: Assessing the Variability in the Relationship Between the Particulate Backscattering Coefficient and the Chlorophyll <i>a</i> Concentration From a Global BiogeochemicalArgo Database
[Authors]: Marie Barbieux; Julia Uitz; Annick Bricaud; Emanuele Organelli; Antoine Poteau; Catherine Schmechtig; Bernard Gentili; Grigor Obolensky; Edouard Leymarie; Christophe Penkerc'h; Fabrizio D'Ortenzio; Herv Claustre
[Date]: 2018-2
# WINDOW
## Paragraph
<p xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><s>Sherbrooke, Canada) are acknowledged for useful comments and fruitful discussion.</s><s>We also thank the International Argo Program and the CORIOLIS project that contribute to make the data freely and publicly available.</s><s>Data referring to <ref type="bibr">(Organelli et al., 2016a)</ref> (doi:10.17882/47142)</s><s>and <ref target="#b8" type="bibr">(Barbieux et al., 2017)</ref> (doi: 10.17882/49388) are freely available on SEANOE.</s></p>
## References Condensed
<biblstruct xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:id="b100">
<monogr>
<title level="j">SEANOE</title>
<imprint>
<date type="published" when="2016">2016</date>
</imprint>
</monogr>
<note type="raw_reference">Organelli, E., M. Barbieux, H. Claustre, C. Schmechtig, A. Poteau, A. Bricaud, J. Uitz, F. D'Ortenzio, and G. Dall'Olmo (2016a), A global bio-optical database derived from Biogeochemical Argo float measurements within the layer of interest for field and remote ocean colour applications, SEANOE, doi:10.17882/47142.</note>
</biblstruct><|im_end|>
<|im_start|>assistant
llm_response
= (
pl.read_parquet('/kaggle/working/llm_out.pq')
.with_columns(
cot = pl.col('completions').str.split('</think>').list.first(),
ans =
pl.col('completions').str.split('</think>').list.last()
)
.with_columns(
pl.col('ans').str.extract(r'oxed\{(.*)\}').alias('ans')
)
.with_columns(
pl.when(pl.col('ans').str.starts_with('\\'))
.then(pl.col('ans').str.extract(r'ext\{(.*)\}'))
.otherwise('ans')
.alias('ans')
)
.with_columns(
pl.when(is_doi.and_((pl.col('ans')=='None').or_(pl.col('ans').is_null())))
.then(S)
.when((~is_doi).and_((pl.col('ans')=='None').or_(pl.col('ans').is_null())))
.then(S)
.when(pl.col('ans')=='Owner')
.then(P)
.otherwise(S)
.alias('type')
)
# .join(
#
get_ground_truth().filter(pl.col('type')!='Missing'), on=['article_id',
'dataset_id'], how='left'
# )
.select('article_id', 'dataset_id',
'type',) #'ans', 'type_right')
)
第2名
关于为什么有些accession id不算数据集的疑问:
E.g. In many cases, some Accession numbers from the same table and repository were picked while others were not. The speculation here was that some kind of NER model is being used, thresholding upon which leaves out some relevant Accession numbers
关于DOI的type分类,纯规则:
Accession: SAMN and EMDB -> Primary
DOI: If dataset is found in multiple papers as per datacite corpus, tag first as Primary rest as Secondary after sorting by publicationDate(这个我怎么就没想到)
DOI: If article_id isSupplementTo dataset_id as per datacite public data file -> Primary
DOI: If more than 4 occurence of same repo (first 4 letters) or more than 4 DOI mentioned around the dataset in article -> Secondary
分类:2个人用不同的方法
第一个人:Mohsin - DOI classification
Most time was spent on creating context for classification (<2048 tokens) . I mapped dataset authors and abstract from datacite public data file. Then I collate context by extracting following parts of paper:
Model -> MedGemma-4B lora
第二个人:Classification Models - Nikhil,训练bert来进行分类。
Data Sources
Training Labels:
Model Architecture
Base Model
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Stabilization Techniques
1. Token Replacement Strategy
2. Hyperparameter Configuration
3. Threshold Optimization
Context Construction Example 除了context,还加了mention time和repository 类型
DOI Context = [Mention Token] + [Repository Token] + [Original Context]
Example:
"More
than 3 mentions" + "Zenodo Repository" + [150 chars context]
Key Success Factors
Model Separation: Treating accession IDs and DOIs as distinct problems
Context Optimization: Different window sizes for different ID types
Feature Engineering: Leveraging metadata (mentions, repository)
Stability Focus: Large batch size and multi-seed averaging
Token Strategy: Replacing irrelevant IDs to reduce noise
第3名
Dataset type classification
At the very beginning we knew that type classification was one of the most important parts of this competition. The reason is that, since we use the F1 score as the metric, a FN or a FP in the retrieval part count only as 1 single error. However, if we get a TP sample but misclassify its type we would get 2 errors: 1 FN for the missing correct type and 1 FP for the wrong predicted type.
Our solution here consists in mainly two steps: a 6-fold Deberta-v3 ensemble and some heuristics.
一些规则(类似第2名)
Similarly to other teams, we first used a few rules that proved to work on both LB and CV:
The remaining citations were then classified using a Deberta-v3 Large ensemble.
Training details
We created a 6-fold StratifiedGroupKFold (stratified by type and grouped by article_id) and trained one deberta-v3 large on each of those fold using a binary classification setup (classification head on top). The following features were used to generate the training prompts:
通过两个trick来使训练更加稳定!
We found two very important tricks to stabilize the training: adding gradient clipping and model/weight EMA. The latter is a very simple technique that consists of having a trainable model via SGD/Adam and a frozen counterpart, which is updated via direct averaging (weighted) of both model's weights after each training step. Using transformers lib, it can be easily implemented by inheriting the Trainer class like the following:
from typing import Dict
import torch
from ema_pytorch import EMA
import copy
class EMATrainer(Trainer):
def __init__(self,
ema_decay=0.9995, ema_update_every=1, *args, **kwargs):
super().__init__(*args,
**kwargs)
# Initialize EMA after
model is set
self.ema_decay =
ema_decay
self.ema_update_every =
ema_update_every
self.ema = None
def _setup_ema(self):
if self.ema is None:
self.ema = EMA(
self.model,
beta=self.ema_decay,
update_every=self.ema_update_every,
update_after_step=50 # Start EMA
after 50 steps
)
def training_step(self,
model, inputs, num_items_in_batch = None):
"""Override training step to include EMA
updates"""
if self.ema is None:
self._setup_ema()
# Perform normal
training step
loss =
super().training_step(model, inputs, num_items_in_batch=num_items_in_batch)
# Update EMA after each
step
self.ema.update()
return loss
def evaluate(self,
eval_dataset=None, ignore_keys=None, metric_key_prefix="eval"):
"""Evaluate using EMA model"""
if self.ema is not None:
# Temporarily use
EMA model for evaluation
original_model =
self.model
self.model =
self.ema.ema_model
results =
super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
# Restore original
model
self.model =
original_model
return results
else:
return
super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
def save_model(self,
output_dir=None, _internal_call=False):
"""Save
EMA models"""
# Save EMA model
ema_output_dir =
f"{output_dir}/ema_model"
if self.ema is not None
and output_dir is not None:
self.ema.ema_model.save_pretrained(ema_output_dir)
else:
self.model.save_pretrained(ema_output_dir)
关于EMA的移动平均和amda的移动平均的区别:
AdamW 中的移动平均:针对梯度的统计特性,而模型 EMA 中的移动平均:针对参数本身
第4名(有代码!需要看看,包括agent的demo)
Type Classification,其实我也应该微调accession type的。
We trained LLM classifiers that handled both DOI and Accession subsets without separate models for each. Training a robust model was tricky due to competition dataset being small and noisy. We adopted a few strategies to address these challenges:
Inference
Links:
第9名
用llm提取出pdf to text后的author等信息
用embedding相似度判断是否是primary
第11名
正文和参考文献分开的方式我也用了,但是我怎么就没有想到用作者信息匹配呢,没有想到用llm提取作者信息
Part 2: DOI Classification (Primary vs. Secondary)
We used two distinct methods for classifying DOIs:
第12名
Type classification
DOIs
LLM-based classification leveraging context and metadata.
system = """You are given (1) an article snippet (Context) and (2) a candidate dataset identifier (DOI) with metadata for both the paper and the dataset.
Decide whether the dataset is used as:
A) Primary — data generated by/for this study
B) Secondary — reused/derived from prior work or previously published dataset
Use BOTH:
• Context: the article snippet discussing data usage
• Metadata similarity: closeness between paper vs. dataset (titles, abstracts, author overlap, topics)
Reply with ONLY one letter: A or B.
"""
user = (
f"Identifier (DOI): {to_str(dsid)}\n"
f"Features: n_citations={to_str(n_cit)}, "
f"is_first_publication={str(bool(row.get('is_first_publication'))).lower()}, "
f"citations_before={to_str(cb)}, "
f"elapsed_days_from_dataset_publication={to_str(dd)}\n\n"
f"=== Paper Metadata ===\n"
f"Title: {to_str(row.get('paper_title'))}\n"
f"Authors: {to_str(row.get('paper_author_name'))}\n"
f"Abstract: {to_str(row.get('paper_abstract'))}\n\n"
f"=== Dataset Metadata ===\n"
f"Title: {to_str(row.get('dataset_title'))}\n"
f"Authors: {to_str(row.get('dataset_author_name'))}\n"
f"Abstract: {to_str(row.get('dataset_abstract'))}\n\n"
f"=== Context (article snippet) ===\n{to_str(row.get('chunk'))}\n"
)
The addition of metadata contributed the most to performance improvement (doi only LB: 0.333 → 0.345).
我微调后的也就0.314,说明metadata很有用。
第15名
Finally, I used Grobid to extract the author list for each paper and determined primary vs. secondary authors based on surnames, using an LLM in a multi-stage decision process.
Grobid是一个开源的机器学习库,专门设计用于处理学术文献。它能够:
提取文档的元数据(如标题、作者、摘要等)
识别和解析文档的结构(如章节、段落、图表等)
提取引用信息和参考文献
将PDF转换为结构化的XML或JSON格式
Grobid特别擅长处理学术论文,能够准确识别论文的各个部分,包括引言、方法、结果和讨论等章节。
第20名
Accession IDs
已使用 OneNote 创建。